stochastic shared embedding
Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information. We provide theoretical guarantees for our method and show its empirical effectiveness on 6 distinct tasks, from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages. We find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed SSE can further reduce overfitting, which often leads to more favorable generalization results.
Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information.
Reviews: Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
The paper presents a novel and interesting regularization method, theoretical analysis and good results, yet I fear its main contributions might be limited to recommendation systems or other fields where knowledge graphs are available, easily constructed, or in their absence, intuitively reasonable to assume a complete graph. Outside those types of tasks, I find it presenting arguments which intuitively were not too compelling, as to why other fields or tasks would significantly benefit from such a method, despite showing improved results on some NLP tasks. The simpler version of the regularizer, which in the absence of a knowledge graph assumes a complete graph, permutes embedding indices with a constant*U(1,N) probability. Despite its appealing theoretical properties, it also poses a risk of introducing a bias of its own. The results on NLP tasks didn't show major improvements and lacked in explanation as to why this type of regularizer would be beneficial and effective for different NLP tasks.
Reviews: Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
The paper proposes to integrate a stochastic relabelling embedding operator within the training of a neural net. The reviewers and the area chair are convinced of the merits of the approach which comes with a theoretical justification (smoothing the Rademacher complexity in the uniform case) and solid comparative empirical evidence. The visualization of the embeddings and their interpretation (in supplementary material and in the rebuttal) are appreciated. The AC hopes that the authors will take into account the suggestions/questions in the reviews, specifically concerning the scope of the approach and its limitations, when writing the camera-ready version of the paper. Another question which comes to mind is whether the knowledge graph (e.g. as learned from a teacher network) can facilitate the training of a student network, e.g.
Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information.
Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
Wu, Liwei, Li, Shuqing, Hsieh, Cho-Jui, Sharpnack, James L.
In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information.